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Abstract 

We  present  a  learning  algorithm  for  hidden  Markov  models  with  continuous  state  and  observa¬ 
tion  spaces.  All  necessary  probability  density  functions  are  approximated  using  samples,  along 
with  density  trees  generated  from  such  samples.  A  Monte  Carlo  version  of  Baum- Welch  (EM) 
is  employed  to  learn  models  from  data,  just  as  in  regular  HMM  learning.  Regularization  during 
learning  is  obtained  using  an  exponential  shrinking  technique.  The  shrinkage  factor,  which  deter¬ 
mines  the  effective  capacity  of  the  learning  algorithm,  is  annealed  down  over  multiple  iterations 
of  Baum-Welch,  and  early  stopping  is  applied  to  select  the  right  model.  We  prove  that  under 
mild  assumptions,  Monte  Carlo  Hidden  Markov  Models  converge  to  a  local  maximum  in  likeli¬ 
hood  space,  just  like  conventional  HMMs.  In  addition,  we  provide  empirical  results  obtained  in  a 
gesture  recognition  domain,  which  illustrate  the  appropriateness  of  the  approach  in  practice. 
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We  present  a  learning  algorithm  for  hidden  Markov  models  with 
continuous  state  and  observation  spaces.  All  necessary  probability 
density  functions  are  approximated  using  samples,  along  with  likelihood 
trees  generated  from  such  samples.  Our  representation  is  proven  to  be 
asymptotically  consistent  with  the  "true"  density.  A  Monte  Carlo  version  of 
Baum-Welch  (EM)  is  employed  to  learn  models  from  data,  just  an  in 
regular  HMM  learning.  Regularization  during  learning  is  obtained  using 
an  exponential  shrinking  technique.  The  shrinkage  factor,  which 
determines  the  effective  capacity  of  the  learning  algorithm  is  annealed 
down  over  multiple  iterations  of  Baum-Welch,  and  early  stopping  is 
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likelihood  space,  just  like  conventional  HMMs.  In  addition,  we  provide 
empirical  results  obtained  in  a  gesture  recognition  domain,  which 
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1  Introduction 

Over  the  last  decade  or  so,  hidden  Markov  models  have  enjoyed  an  enormous  practical  success  in 
a  large  range  of  temporal  signal  processing  domains.  Hidden  Markov  models  are  often  the  method 
of  choice  in  areas  such  as  speech  recognition  [28, 27, 42],  natural  language  processing  [5],  robotics 
[34,  23, 48],  biological  sequence  analysis  [17,  26, 40],  and  time  series  analysis  [16, 55].  They  are 
well-suited  for  modeling,  filtering,  classification  and  prediction  of  time  sequences  in  a  range  of 
partially  observable,  stochastic  environments. 

)^th  few  exceptions,  existing  HMM  algorithms  assume  that  both  the  state  space  of  the  envi¬ 
ronment  and  its  observation  space  are  discrete.  Some  researchers  have  developed  algorithms  that 
support  more  compact  feature-based  state  representations  [15, 46]  which  are  nevertheless  discrete; 
others  have  suecessfully  proposed  HMM  models  that  can  cope  with  real- valued  observation  spaces 
[29, 19, 48].  Kalman  filters  [21, 56]  can  be  thought  of  as  HMMs  with  continuous  state  and  action 
spaces,  where  both  the  state  transition  and  the  observation  densities  are  linear-Gaussian  functions. 
Kalman  filters  assume  that  the  uncertainty  in  the  state  estimation  is  always  normally  distributed 
(and  hence  unimodal),  which  is  too  restrictive  for  many  practical  application  domains  (see  e.g., 
[4, 18]). 

In  contrast,  most  “natural”  state  spaces  and  observation  spaces  are  continuous.  For  example, 
the  state  space  of  the  vocal  tract  of  human  beings,  which  plays  a  primary  role  in  the  generation 
of  speech,  is  continuous;  yet  HMMs  trained  to  model  the  speech-generating  process  are  typically 
discrete.  Robots,  to  name  a  second  example,  always  operate  in  continuous  spaces;  hence  their 
state  spaces  are  usually  best  modeled  by  continuous  state  spaces.  Many  popular  sensors  (cam¬ 
eras,  microphones,  range  finders)  generate  real-valued  measurements,  which  are  better  modeled 
using  continuous  observation  spaces.  In  practice,  however,  real-valued  observation  spaces  are 
usually  truncated  into  discrete  ones  to  accommodate  the  limitations  of  conventional  HMMs.  A 
popular  approach  along  these  lines  is  to  learn  a  code-book  (vector  quantizer),  which  clusters  real- 
valued  observations  into  finitely  many  bins,  and  thus  maps  real- valued  sensor  measurements  into 
a  discrete  space  of  manageable  size  [54].  The  discreteness  of  HMMs  is  in  stark  contrast  to  the 
continuous  nature  of  many  state  and  observation  spaces. 

Existing  HMM  algorithms  possess  a  second  deficiency,  which  is  frequently  addressed  in  the 
AI  literature,  but  rarely  in  the  literature  on  HMMs:  they  do  not  provide  mechanisms  for  adapting 
their  computational  requirements  to  the  available  resources.  This  is  unproblematic  in  domains 
where  computation  can  be  carried  out  off-line.  However,  trained  HMMs  are  frequently  employed 
in  time-critical  domains,  where  meeting  deadlines  is  essential.  Any-time  algorithms  [9, 58]  address 
this  issue.  Any-time  algorithms  can  generate  an  answer  at  any  time;  however,  the  quality  of  the 
solution  increases  with  the  time  spent  computing  it.  An  any-time  version  of  HMMs  would  enable 
them  to  adapt  their  computational  needs  to  what  is  available,  thus  providing  maximum  flexibility 
and  accuracy  in  time-critical  domains.  Marrying  HMMs  with  any-time  computation  is  therefore  a 
desirable  goal. 

This  paper  presents  Monte  Carlo  Hidden  Markov  Models  (MCHMMs).  MCHMMs  employs 
continuous  state  and  observation  spaces,  and  once  trained,  they  can  be  used  in  an  any-time  fashion. 
Our  approach  employs  Monte  Carlo  methods  for  approximating  a  large,  non-parametric  class  of 
density  functions.  To  combine  multiple  densities  (e.g.,  with  Bayes  rule),  it  transforms  sample  sets 
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into  density  trees.  Since  continuous  state  space  are  sufficiently  rich  to  overfit  any  data  set,  our 
approach  uses  shrinkage  as  mechanism  for  regularization.  The  shrinkage  factor,  which  determines 
the  effective  capacity  of  the  HMM,  is  annealed  down  over  multiple  iterations  of  EM,  and  early 
stopping  is  applied  to  choose  the  right  model.  We  prove  the  convergence  of  MCHMMs  to  a  local 
maximum  in  likelihood  space.  This  theoretical  results  justifies  the  use  of  MCHMM  in  a  wide  array 
of  applications.  In  addition,  empirical  results  are  provided  obtained  for  an  artificial  domain  and  a 
gesture  recognition  domain,  which  illustrates  the  robustness  of  the  approach  in  practice. 

The  remainder  of  this  paper  is  organized  as  follows.  Section  2  establishes  the  basic  terminol¬ 
ogy  for  generalizing  HMMs  to  continuous  state  and  observation  spaces.  We  will  then,  in  Section 
Section  3,  discuss  commonly  used  sampling  schemes  and  provide  an  algorithm  for  generating 
piece-wise  constant  density  trees  from  these  samples.  A  key  result  in  this  section  is  a  proof  of 
asymptotic  consistency  of  both  sample-based  and  tree-based  representations.  Section  4  describes 
our  approach  for  complexity  control  (regularization)  using  shrinkage,  followed  by  the  statement  of 
the  MCHMM  algorithm  in  Section  5.  Section  6  proves  the  MCHMM  convergence  theorem,  which 
states  that  under  mild  assumptions  MCHMMs  converge  with  high  probability.  Empirical  results 
are  described  in  Section  7,  which  specifically  investigates  MCHMM  in  the  finite  sample  case.  The 
empirical  results  show  that  MCHMMs  work  well  even  in  the  non-asymptotic  case.  Section  7  also 
provides  an  experiment  that  characterizes  the  relation  between  sample  set  size  (computation)  and 
accuracy.  Finally,  related  work  is  discussed  in  Section  8  and  the  paper  is  summarized  in  Section  9. 

2  Generalized  Hidden  Markov  Models 

This  section  introduces  generalized  hidden  Markov  models  (in  short:  GHMM).  GHMMs  gener¬ 
alize  conventional  hidden  Markov  models  (HMMs)  in  that  all  spaces,  state  and  observation,  are 
continuous.  Our  description  closely  follows  that  of  Rabiner  [43],  with  densities  replacing  finite 
probability  distributions  throughout.  Throughout  this  paper,  we  assume  all  event  spaces  and  ran¬ 
dom  variables  (e.g.,  state,  observations)  are  measurable.  We  also  assume  that  unless  otherwise 
specified,  all  probability  distributions  are  continuous  and  possess  continuous  density  functions. 
Further  below,  when  introducing  density  trees,  we  will  also  assume  that  densities  are  non-zero 
over  a  compact,  bounded  region,  and  that  they  obey  a  Lipschitz  condition. 

A  GHMM  is  a  partially  observable,  time-invariant  Markov  chain  with  continuous  state  and  ob¬ 
servation  spaces  and  discrete  time.  Let  x  denote  a  state  variable  (a  measurable  random  variable), 
defined  over  some  continuous  space  (e.g.,  for  some  k).  At  each  time  t  >  1,  the  HMM’s  state 
is  denoted  Xi.  Imtially,  at  time  t  =  1,  the  state  of  the  HMM  is  selected  randomly  according  to  the 
density  tt.  State  transitions  are  governed  by  a  conditional  probability  density,  denoted  ii{x'  |  x) 
and  called  state  transition  density.  Densities  are  measurable  function  over  the  set  of  Borel  sets, 
hence  the  Riemann  integral 

/  fi{x'  I  x)  dx'  (1) 

Jxq 

measures,  for  xq  <  xi,  the  probability  Pr{xo  <  x'  <  xi  \  x)  that  the  state  preceding  x  lies  in 

[a;o,a:i). 

In  HMMs  (and  thus  in  GHMMs),  state  cannot  be  observed.  Instead,  only  a  probabilistic 


Monte  Carlo  Hidden  Markov  Models 


3 


projection  of  the  state  is  observable.  Let  bf  denote  a  measurable  random  variable  that  models  the 
observation  at  time  t.  Observations  are  generated  according  to  a  probability  density  conditioned 
on  the  state  of  the  HMM  (called  the  observation  density),  denoted  i>{b  \  x)  .If  bo  <  bi. 


measures  the  probability  that  the  observation  b  is  in  [6o,  bi),  given  that  the  state  of  the  HMM  is  x. 
Thus,  a  generalized  HMM  is  uniquely  defined  trough  three  densities: 

A  =  {tt, //,!/}.  (3) 

Putting  computational  limitations  aside  for  the  moment,  knowledge  of  A  is  sufficient  to  tackle  a 
variety  of  interesting  practical  problems: 

•  Computing  distributions  over  states  and  observations  at  arbitraiy  points  in  time, 

•  Generating  representative  example  trajectories  in  state  and  observation  space, 

•  Determining  the  likelihood  of  example  trajectories  under  an  HMM,  and 

•  Classifying  data  sequences  generated  by  mixtures  of  labeled  HMMs. 

Algorithms  for  these  problems  are  described  in  detail  in  [43];  they  are  easily  transferred  from  the 
finite  to  the  continuous  case. 

In  practice,  the  densities  A  are  often  unknown  and  have  to  be  estimated  from  data.  The  data, 
denoted  d,  is  a  sequence  of  observations^,  denoted 

d  =  (4) 

Here  Ot  denotes  the  observation  at  time  t.  The  total  number  of  observations  in  d  is  T. 

The  well-known  Baum- Welch  algorithm  [2,  33,  43]  provides  a  computationally  efficient  and 
elegant  approach  for  learning  tt,  //,  and  u.  Baum- Welch  begins  with  an  initial  model,  denoted 
A^°).  It  iterates  two  steps,  an  E-step  and  an  M-step  (see  also  [12]).  In  the  n-th  E-step,  distribution 
for  the  various  state  variables  xt  are  computed  under  a  fixed  model  A(")  (with  n  >  0).  The  n-th 
M-step  uses  these  distributions  to  derive  a  new,  improved  model  A(”+^)  .  As  shown  for  example  in 
[33,  37],  both  steps  increase  the  data  likelihood  Pr{d  \  A),  or  leave  it  unchanged  if,  and  only  if,  a 
local  maximum  in  the  likelihood  function  has  been  reached. 

In  the  E-step,  distributions  are  computed  for  the  state  variables  xt  conditioned  on  a  fixed  model 


A  and  the  data  d.  Recall  that  in  the  discrete  case, 

cx^^\x)  -  Pr[xt  =  x\Oi,...,Ou>y^)  (5) 

I3\^\x)  =  Pr{Ot^,,...,OT\xt  =  x,\^^^)  (6) 

=  Pr{xt  =  X  \  d,X^'^^)  (7) 

x')  =  Pr{xt  =  x,  Xt^ri  =  x'  \d,  A^"))  (8) 


^For  simplicity  of  the  presentation,  we  only  present  the  case  in  which  the  data  consist  of  a  single  sequence.  The 
extension  to  multiple  sequences  is  straightforward  but  requires  additional  notation. 
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The  continuous  case  is  analogous;  however,  here  a  and  3  are  densities,  and  thus  may  be  larger  than 
1.  Following  [43],  these  densities  are  computed  incrementally;  a  is  computed  forward  in  time,  and 
/?  backwards  in  time  (for  which  reason  this  algorithm  is  often  referred  to  as  \ht  forward-backward 
algorithm.  Initially, 


o:q^^(x)  =  7r^”^(a:) 

/^P(x)  =  1 

and  for  all  other  at  and  j3t: 

M’^Hx)  =  J  I  .r')  iM^O,  \  x)  dx' 

P\^\x)  =  I pIi\{x')  I  .r)  ^^("'(0,+,  I  .r')  dx' 

Bayes  rule  governs  the  conditional  density  over  the  state  space  at  time  t: 

a^f^{x')  dx' 

Similarly,  the  state  transition  densities  are  computed  as 

a^f^\x)  n(-)(x'  I  .r)  I  x') 

I  -r)  I  x')  i3f!l\(x)  dx  dx 


II 


I 


(9) 

(10) 


(11) 

(12) 


(13) 


(14) 


This  computation  is  completely  analogous  to  the  finite  case,  replacing  conditional  probabilities  by 
conditional  densities. 

The  M-step  uses  and  {x,  x')  to  compute  a  new  model  using  the  maximum 

likelihood  estimator: 


TT^'^^^\x)  = 

H^^+^\x'\x)  = 

z/(”+i)(6|a;)  = 


7o”^(a;) 


(15) 

(16) 

(17) 


Here  Icond  denotes  an  indicator  variable  that  is  1  if  the  condition  cond  is  true,  and  0  otherwise.  A 
straightforward  result  is  the  convergence  of  GHMMs  under  approprate  conditions. 

Theorem  1.  (GHMM  Convergence  Theorem)  If  all  distributions  X  to  a  GHMM possess  den¬ 
sity  functions  that  are  Lipschitz  (and  differentiable)  in  all  variables,  then  for  almost  all  (measure  1) 
points  the  steps  outlined  above  do  not  decrease  the  probability  density  of  the  observation  sequence. 
They  do  not  improve  the  probability  density  of  the  observation  sequence  at  each  iteration  if,  and 
only  if,  the  distributions  at  the  beginning  of  an  EM  step  are  at  a  critical  point  (local  maximum, 
minimum,  or  saddle  point). 
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Proof.  Only  sketched  here  (see  also  [2,  20,  33],  and  see  [47]  for  an  extension  to  certain  real¬ 
valued  spaces).  For  finite-state  finite-observation  HMMs,  all  densities  and  variables  (//,  i/,  ir,  a, 
j5,  7,  and  0  can  be  implemented  by  vectors,  and  the  Baum- Welch  algorithm  has  been  proven  to 
converge  [2].  Juang  [19]  has  shown  convergence  to  local  maxima  for  HMMs  with  a  finite  number 
of  states  and  continuous  observations  where  the  observation  densities  are  a  mixture  of  log  concave 
or  ellipsoidal  symmetrical  densities.  In  particular,  the  class  of  ellipsoidal  symmetrical  densities 
includes  Gaussians.  A  GHMM  can  be  viewed  as  the  limit  as  the  number  of  Gaussians  is  allowed 
to  increase  to  infinity,  and  then  the  number  of  states  is  allowed  to  increase  to  infinity  in  Juang’s 
analysis.  Since  Gaussians  are  fine  bump  functions  any  continuous  v{0\x)  can  be  the  limit. 

K 

\\m  'Y^CkxhxiO)  (18) 

where  J2k  Ckx  =  1  and  bkx{0)  is  a  Gaussian. 

Juang’s  analysis  contains  the  following  forms  of  manipulation: 


Vy  /  f{x,y)dx  =  /  VYf{x,y)  dx 

Jx  Jx 

where  f{x,  y)  is  some  function  proportional  to  a  density, 
sufficient  assumption  for  these  statements  to  hold  true. 

3  Density  Approximation 

This  section  describes  sample-based  and  tree-based  methods  for  density  approximation.  The  no¬ 
tion  of  asymptotic  consistency  is  introduced  and  results,  along  with  error  bounds,  are  given  for  a 
popular  sampling  algorithm  and  a  tree  method.  Throughout  this  section,  we  assume  that  all  density 
functions  are  centered  on  a  compact  and  bounded  region  in  3?^  (for  an  arbitrary  k)  and  that  they 
are  Lipschitz. 

3.1  Sampling 

Samples  are  vedues  drawn  fi'om  the  domain  of  a  density  function,  where  each  sample  is  annotated 
by  a  non-negative  real  value  [30,  36,  57].  Sample  sets  are  (finite)  sets  of  samples,  annotated  by 
numerical  probability  values.  More  specifically,  let  /  be  a  probability  density  function,  and  let  N 
denote  a  positive  number  (the  cardinality  of  a  sample  set).  Then  a  sample  set  is  a  set 

N 

X  =  ...,(a;iv,P:rjv)}  withJ3l’®n  =  l  (21) 

n=l 

where  x„  e  dom(/)  and  px„  €  [0, 1]  for  all  n  with  1  <  n  <  N.  Sample  sets  can  be  thought  of 
as  discrete  distributions  over  the  event  space  {  x  i , . . . ,  x  jv }  and  probability  distribution  defined  by 
i.Pxi )  •  •  •  j  Pxjf}  [36, 41]. 


(20) 

Differentiability  of  all  densities  is  a 

□ 
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One  popular  sampling  method  is  called  importance  sampling,  which  was  originally  introduced 
by  Rubin  [44].  Importance  sampling  approximates  a  density  /  by  drawing  samples  x  from  a 
distribution  with  density  g,  and  assigning  a  weight  proportional  to  Obviously,  the  density 
g  must  be  non-zero  over  the  support  of  /,  i.e., 

f{x)  >  0  g{x)  >  0  (22) 

We  will  distinguish  two  special  cases: 

1 .  Likelihood-weighted  sampling.  If  g=f,  we  will  refer  to  the  sampling  algorithm  as  likelihood- 
weighted  sampling.  Here  we  draw  samples  according  to  /  and  assigns  equal  probability  to 
each  sample  in  the  sample  set  [22]: 


Xn  is  drawn  according  to  / 

Px„  =  (23) 

In  many  cases,  likelihood-weighted  sampling  can  easily  be  implemented  using  rejection  sam¬ 
pling  [36]. 

2.  Uniform  sampling.  If  p  is  uniform  over  (a  superset  of)  the  support  of  /,  we  will  call  the 
sampling  algorithm  uniform  sampling.  This  sampling  algorithm  draws  values  randomly  from 
the  support  of  /  (which  is  assumed  to  be  bounded),  and  assigns  to  each  value  x  a  probability 
proportional  to  its  density  f{x) : 


Xn  is  drawn  uniformly  from  dom  (/) 

■  AT  1-1 

Pxn  =  fiXn) 

.i=l 


(24) 


Uniform  sampling  covers  the  domain  of  a  density  uniformly  regardless  of  the  nature  of  the  density 
function.  Likelihood-weighted  sampling  populates  the  space  according  to  /,  so  that  the  density 
of  sample  is  proportional  to  the  density  /.  Likelihood- weighted  sampling  can  be  said  to  “waste” 
fewer  samples  in  low-likelihood  regions  of  /  [41].  In  other  words,  if  Xu  and  Xiw  are  sample  sets 
generated  from  the  same  distribution  using  uniform  sampling,  and  likelihood-weighted  sampling, 
respectively,  the  following  holds  true  in  expectation: 

^  (25) 

{x,Px)eXu  (x,Px)€Xiy, 

were  the  equality  holds  in  expectation  if  and  only  if  /  is  uniform  [51].  Henceforth,  we  will  focus 
our  attention  on  likelihood-weighted  sampling  whenever  we  have  an  explicit  representation  of  a 
probability  distribution  (and  importance  sampling  otherwise). 

Figure  la  shows  a  sample  set  drawn  by  likelihood- weighted  sampling  from  a  distribution  that 
resembles  the  shape  of  a  sine  wave  in  2D.  All  probabilities  of  the  sample  set  shown  there  are 
the  same,  and  the  samples  are  concentrated  in  a  small  region  of  the  5?^.  In  practice,  likelihood- 
weighted  sampling  is  often  given  preference  over  uniform  sampling — specifically  if  the  target 
density  is  known  to  only  populate  (with  non-zero  measure)  a  small  subspace  of  its  domain. 


Monte  Carlo  Hidden  Markov  Models 


7 


Fignre  1;  (a)  Data  set  (b)  partitioned  by  a  density  tree. 


Both  sampling  methods  can  equally  be  applied  to  sample  from  a  sample  set  (resampling).  Let 
X  be  a  sample  set.  Under  uniform  sampling,  a  new  sample  set  is  generated  by  randomly  drawing 
samples  (a;,  px)  from  X  with  a  uniform  distribution  (regardless  of  the  px  values).  The  p^^-values 
in  the  new  sample  set  are  then  re-normalized  so  that  they  add  up  to  1.  Under  likelihood- weighted 
sampling,  samples  are  drawn  from  X  according  to  the  (discrete)  probability  distribution  induced 
by  their  Par-values.  Each  sample  is  then  assigned  the  same  probability.  Sampling  from  sample  sets 
plays  an  important  role  in  the  Monte  Carlo  HMM  algorithm  described  below. 


3.2  Asymptotic  Consistency 


The  central  idea  of  sampling  is  to  represent  density  functions  by  finite  sample  sets,  which  are  com¬ 
putationally  advantageous.  So  what  does  it  mean  for  a  sample  set  X  to  “represent”  a  density  /? 
Obviously,  sample  sets  represent  discrete  distributions  and  thus  cannot  represent  distributions  that 
possess  densities  (hence  are  continuous).  However,  we  will  call  a  sampling  method  asymptotically 
consistent  if  for  N  oo,X  converges  to  /  with  probability  1  when  integrated  over  the  system  of 
half-open  Borel  sets: 


^  ^  ^Xo<X<.X\  Px 

{x,Px)£X 


w.p.  1 


(26) 


Recall  the  system  of  half-open  intervals  is  an  inducing  system  for  the  sigma  algebra  over  the  3?*; 
hence,  it  suffices  to  show  convergence  for  those  sets. 

A  key  observation  is  that  both  sampling  algorithms  are  asymptotically  consistent — ^as  are  all 
importance  sampler  with  f{x)  >  0  g{x)  >  0  [51].  This  is  formalized  for  the  more  im¬ 
portant  likelihood-weighted  sampling  technique  in  the  following  theorem,  which  also  provides  a 
probabilistic  error  bound  for  the  finite  case. 

Theorem  2.  The  likelihood-weighted  sampling  method  is  asymptotic  consistent.  For  finite  N, 
e  >  0  and  xq  <  xi,the  likelihood  that  the  error  between  a  density  /  and  a  sample  X  is  larger  than 
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e  can  be  bounded  as  follows: 


Pr 


^  Ixo<x<xi  Px-  f{x)  dx 

{x,P:r)eX 


(27) 


Thus,  according  to  the  theorem  the  error  between  /  and  its  sample  decreases  at  the  familiar  rate 

Proof.  The  convergence  follows  directly  from  the  Central  Limit  Theorem  (see  also  [13,  41, 
51]).  The  bound  follows  from  the  Hoeffding  bound,  under  the  observation  that  for  any  half¬ 
open  interval  [xq,  xi)  each  sample  can  be  viewed  as  a  zero-one  “guess”  of  the  size  of  the  area 

Q  m  dx.  □ 

More  generally,  the  convergence  rate  of  importance  sampling  is  in  O  (;^)  if  fracf{x)g{x)  < 
oo.  The  variance  of  the  error  depends  on  the  “mismatch”  between  /  and  g  (see  e.g.,  [51],  pg.  33). 
The  N  will  be  reduced  by  a  constant  factor  of  max^;  <  oo  for  appropriate  g{x). 


3.3  Density  IVees 

While  sample  sets  are  sufficient  to  approximate  continuous-valued  distributions,  they  differ  from 
those  in  that  they  are  discrete,  that  is,  even  in  the  limit  they  assign  non-zero  likelihood  to  only 
a  countable  number  of  points.  This  is  problematic  if  one  wants  to  combine  densities  represented 
through  different  sample  sets:  For  example,  let  /  and  g  be  two  density  functions  defined  over 
the  same  domain,  and  let  X  be  a  sample  of  /  and  Y  a  sample  of  g.  Then  with  probability  1, 
none  of  the  samples  in  X  and  Y  are  identical,  and  thus  it  is  not  straightforward  how  to  obtain  an 
approximation  of  their  product  /  •  g  from  X  and  Y ,  Notice  that  multiplications  of  densities  are 
required  by  the  Baum-Welch  algorithm  (see  e.g..  Equation  (13)). 

Density  trees,  which  are  quite  common  in  the  statistical  literature  [24,  35,  38,  39],  transform 
sample  sets  into  density  functions.  Unfortunately,  not  all  tree  growing  methods  are  asymptoti¬ 
cally  consistent  when  applied  to  samples  generated  from  a  density  /.  We  will  describe  a  simple 
algorithm  which  we  will  prove  to  be  asymptotically  consistent. 

Our  algorithm  annotates  each  node  in  the  tree  with  a  hyper-rectangular  subspace  of  dom(/), 
denoted  by  V  (or  Vi  for  the  i-th  node).  Initially,  all  samples  are  assigned  to  the  root  node,  which 
covers  the  entire  domain  of  /.  A  node  i  is  split  whenever  the  following  two  conditions  are  fulfilled: 

1.  At  least  y/N  samples  {x,px)  £  X  fall  into  in  Vi. 

2.  Its  depth,  i.e.,  its  distance  from  the  root  node,  does  not  exceed  log2  N\ . 

If  a  node  is  split,  its  interval  v  is  divided  into  two  equally  sized  intervals  along  its  longest  dimen¬ 
sion.  These  intervals  are  assigned  to  the  two  children  of  the  node.  Otherwise,  a  node  becomes  a 
leaf  node  of  the  density  tree. 

An  example  of  such  a  density  tree  defined  over  3?^  is  illustrated  in  Figure  2.  Here  the  root  node 
covers  the  entire  domain  of  /.  Each  child  covers  exactly  half  the  space  of  each  parent  node.  The 
areas  covered  by  each  leaf  are  mutually  exclusive;  their  union  covers  the  entire  space.  Figure  lb 
shows  a  density  tree  generated  for  the  sample  set  shown  in  Figure  la.  The  reader  may  notice 
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Figure  2:  Density  tree.  Each  branch  cuts  the  area  under  a  node  into  two  equally-sized,  rectangular  halves,  shown  in 
grey. 


that  the  criteria  of  the  conditions  above  have  been  chosen  to  facilitate  our  proof  of  asymptotic 
consistency;  in  practice,  a  much  wider  range  of  criteria  for  stopping  tree  growth  can  be  applied. 

Density  trees  define  density  functions  that  continue  discrete  sample  sets  into  Let  /  be  a 
density  function,  X  a  sample  drawn  from  this  density,  and  for  all  x  €  dom(/)  let  i(x)  denote  the 
leaf  whose  region  Vi  contains  x.  Furthermore,  let  d,  denote  the  relative  frequency  of  samples  in 
the  i-th  node  weighted  by  their  respective  p^-  values: 

d-j  =  XI  Px  (28) 

{x,px)ex 

Then  the  density  function  of  a  tree,  denoted  /,  is  defined  as  follows: 

f{x)  =  1^1^  (Va;  €  dom(/))  (29) 

The  numerator  of  (29)  describes  the  weighted  relative  frequency  that  a  sample  falls  into  the  interval 
of  the  node  i{x).  The  denominator  is  the  size  of  the  interval,  which  is  necessary  to  transform  a 
relative  frequency  into  a  density  over  the  interval 

The  density  function  f{x)  can  be  equally  defined  for  internal  nodes  (non-leaf  nodes).  This  will 
be  important  below,  where  estimates  at  dilferent  levels  of  the  tree  are  mixed  for  regularization. 
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3.4  Asymptotic  Consistency  of  Density  "Dees 

An  important  result  is  the  asymptotic  consistency  of  density  trees.  This  result,  which  justifies 
the  use  of  density  trees  are  non-parametric  approximation,  will  be  formalized  in  the  following 
theorem: 

Theorem  3.  Density  trees  are  asymptotically  consistent,  that  is,  for  all  xo,xi  e  dom(f)  with 
xo<  xi  and  for  any  density  tree  f{x)  generated  from  a  sample  set  X  generated  from  a  density  f, 
the  following  holds  true: 

^  rool 

;v!^oo  J  ~  J  f{^)dx  wp.  1.  (30) 


We  will  first  prove  a  related  result,  namely  that  of  pointwise  stochastic  convergence 

/(*)  =  f{x)  (Va:  e  dom(/))  w.p.  1  (31) 

which  under  the  assumptions  made  in  this  paper  implies  the  theorem.  The  proof  is  carried  out  in 
two  stages,  each  of  which  is  described  by  its  own  lemma. 

Lemma  1.  The  density  f  converges  to  a  function  f  which  is  defined  through  the  same  tree  as 
/,  but  with  the  weighted  relative  frequencies  ai  replaced  by  their  true  frequencies,  denoted  di: 


Lemma  2.  The  density  f  converges  to  /. 

Proof  of  Lemma  1 .  For  each  leaf  i,  the  Hoeffding  bound  states  that 
Pr{\di  -  Oil  >  e)  <  2  (33) 

Our  tree  growing  rule  limits  the  depth  of  the  tree  to  [  |  logj  N J .  Hence,  there  are  at  most 

=  iV4  n4t 


nodes  in  the  tree.  Thus,  the  probability  that  there  exists  a  leaf  whose  empirical  frequency  di 
deviates  from  the  true  probability  d,  is  bounded  by 

Pr(3  leaf  i :  |<fj  -  d-il  >  £•)  <  2N*e~'^^^^.  (35) 


The  error  of  the  empirical  frequency  estimates  is  now  translated  into  density  errors.  Recall  that 
according  to  (29),  the  relative  frequency  is  related  to  the  density  of  a  tree  by  the  volume  Vi  of  each 
leaf.  Consequently, 

\f{x)-f{x)\  =  \Vi\-'^  \di  -  Oil  (36) 


Observing  that  the  volume  of  the  interval  covered  by  a  leaf  |V;  |  is  at  least  N~4 — ^which  directly 
follows  from  the  depth  limit  imposed  when  growing  density  trees — ^we  obtain 

\f{x)-f{x)\  <  N-i\di-ai\ 

^  iVTi/»-/>)|  < 


(37) 

(38) 
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Substituting  this  into  (35)  yields: 

Pr{\f{x)  -  f{x)\  >  ivie)  <  2  ivi  (39) 

which  with  the  substitution  e'  =  N^e  leads  to  the  following  error  bound  between  /  and  / : 

Pr{\f{x)- f{x)\>  e')  <  (40) 

Obviously,  the  right  hand-size  of  (40)  converges  to  0  as  iV  ^  oo,  which  proves  pointwise  conver¬ 
gence  of  /to/.  □ 

It  remains  to  be  shown  that  /  converges  of  /. 

Proof  of  Lemma  2.  It  suffices  to  show  that  with  high  likelihood,  any  leaf  i  that  covers  an  inter¬ 
val  with  non-zero  measure,  i.e.. 


(Ti  >  0 


(41) 


will  be  split  infinitely  often  as  AT  is  increased.  The  desired  convergence  then  follows  directly  from 
the  fact  that  /  obeys  a  Lipschitz  condition.  The  proof  is  given  for  likelihood-weighted  sampling. 

Let  i  be  a  leaf  node,  and  let  depth  (e)  denote  its  depth.  Furthermore,  let  Ui  be  the  number  of 
samples  in  node  i: 

Ui  =  ^  4ev;  =  N  ai  (42) 


The  seeond  equality  exploits  the  assumption  that  samples  are  generated  by  likelihood-weighted 
sampling.  Recall  that  the  node  i  is  split  if  Ui  >  y/N,  for  sufficiently  large  N.  Without  loss  of 
generality,  let  us  assume  that 


N  >  max 


/  ^  ^depth(i)  1 

I  {ai  -s)^  '  J 


and 


e  <  (T; 


(43) 


The  second  term  in  the  max  ensures  that  the  depth  limit  log2  N\  in  the  tree  growing  rule  does 
not  restrict  splitting  node  i.  The  other  assumptions  (43)  imply  that 


(Ti  > 


(44) 


The  Hoeffding  bound  now  yields  the  desired  result: 


Pr{(7i  -  0 

■i  >  £) 

(45) 

Pr 

f  1 

\Vn 

-he  - 

CFi  ^ 

(46) 

Pr 

> 

< 

(47) 

Pr 

fJ- 

\yfN 

>  — 1 
N) 

1  < 

(48) 

Pr 

[VN  >  n,) 

(49) 
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Thus,  with  high  probability  n,  >  y/N  and  node  i  is  split.  The  pointwise  convergence  of  /  to  / 
now  follows  from  our  assumption  that  /  is  Lipschitz.  □ 

Proof  of  the  Theorem  3.  We  have  already  shown  that 

^i^  f{x)  =  f{x)  w.p.  1  and  Va;  €  dom(/).  (50) 

Theorem  3  follows  trivially  from  the  triangle  inequality,  since  dom(/)  is  bounded  and  Lipschitz 
(which  implies  that  /  is  bounded).  □ 

The  termination  conditions  for  growing  trees  (see  itemized  list  in  Section  3.3)  where  chosen 
to  facilitate  the  derivation  of  Theorem  3.  In  practice,  these  conditions  can  be  overly  restrictive, 
as  they  often  require  large  sample  sets  to  grow  reasonable-sized  trees.  Our  actual  implementation 
sidesteps  these  conditions,  and  trees  are  grown  all  the  way  to  the  end.  While  the  convergence 
results  reported  here  are  not  applicable  any  longer,  our  implementation  yielded  much  better  per¬ 
formance  specifically  when  small  sample  sets  were  used  (e.g.,  100  samples). 

4  Regularization  Through  Shrinkage  and  Annealing 

We  will  now  resume  our  consideration  of  GHMMs.  The  continuous  nature  of  the  state  space  in 
GHMMs,  if  represented  by  arbitrary  sample  sets  and/or  trees,  can  easily  overfit  any  data  set,  no 
matter  how  large.  This  is  because  GHMMs  are  rich  enough  to  assign  a  different  state  to  each 
observation  in  the  training  data  (of  which  there  are  only  finitely  many),  maVing  it  essentially 
impossible  to  generalize  beyond  sequences  other  than  the  ones  presented  during  training.  A  similar 
problem  arises  in  conventional  HMMs,  if  they  are  given  more  states  than  samples  in  the  data  set. 
In  GHMMs  the  problem  is  inherent,  due  to  the  topology  of  continuous  spaces.  Thus,  some  kind 
of  regularization  is  needed  to  prevent  overfitting  fi*om  happening. 

Our  approach  to  regularization  is  based  on  shrinkage  [50].  Shrinkage  is  a  well-known  statisti¬ 
cal  technique  for  “lumping  together”  different  estimates  fi-om  different  data  sources.  In  a  remark¬ 
able  result  by  Stein  [50],  shrinking  estimators  were  proven  to  yield  uniformly  better  solutions  over 
unbiased  maximum-likelihood  estimators  in  multivariate  Gaussian  estimations  problems  (see  also 
[53]).  Shrinkage  trees  were  introduced  in  [32].  Instead  of  using  the  density  estimates  at  the  leafs 
of  a  tree,  shrinkage  trees  mix  those  densities  with  densities  obtained  further  up  in  the  tree.  These 
internal  densities  are  less  specific  to  the  region  covered  by  a  leaf  node;  however,  their  estimates 
are  usually  obtained  from  more  data,  making  them  less  susceptible  to  variance  in  the  data. 

Figure  3  shows  an  example  using  shrinkage  with  an  exponential  factor,  parameterized  by  p 
(with  0  <  /9  <  1).  Here  each  node  in  the  tree,  with  the  exception  of  the  root  node,  weighs  its 
own  density  estimate  by  (1  —  p),  and  mixes  it  with  the  density  estimate  from  its  parent  using  the 
weighting  factor  p.  As  a  result,  every  node  along  the  path  contributes  to  the  density  estimate  at  the 
leaf;  however,  its  influence  decays  exponentially  with  the  distance  from  the  leaf  node.  Obviously, 
the  value  of  p  determines  the  amount  of  shrinkage.  If  p  =  1,  only  the  root  node  is  consulted, 
hence,  the  probability  density  induced  by  the  tree  is  uniform.  If  p  =  0,  on  the  other  hand,  there  is 
no  shrinking  and  only  the  estimates  in  the  leaf  nodes  determine  the  shape  of  the  density  function. 
For  intermediate  values  of  p,  estimates  along  the  entire  path  are  combined. 

Since  the  optimal  value  of  p  is  problem-specific — de  facto  it  depends  on  the  nature  of  the 
(unobservable)  state  space — our  approach  uses  annealing  and  cross  validation  to  determine  the 
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nodei 


Figure  3:  Shrinkage  for  complexity  control.  The  density  of  each  node  is  an  exponentially  weighted  mixture  of  densities 
of  the  nodes  along  the  path  to  the  leaf.  Shown  here  is  an  example  for  node  m,  whose  path  leads  through  nodes  z,  j,  k, 
and  1.  The  p-terms  describe  the  mixture  coefficients  for  this  example.  If  /o  is  0,  only  the  leaf  estimate  is  used.  For  larger 
values  of  p,  estimates  from  data  beyond  the  leaf  data  are  used  to  estimate  the  target  density. 


best  value  for  p.  More  specifically, 

p(”)  =  p"  (51) 

where  p  <  1  is  a  constant  (e.g.,  0.9)  and  n  denotes  the  iteration  counter  of  the  Baum-Welch 
algorithm  (starting  at  0).  Thus,  p  starts  at  is  maximum  value  (at  which  nothing  can  be  learned, 
since  every  density  is  uniform),  which  is  then  annealed  towards  zero  in  an  exponential  fashion. 
Cross  validation  (early  stopping)  is  applied  to  determine  when  to  stop  training. 


5  Monte  Carlo  HMMs 

We  are  now  ready  to  present  the  main  algorithm  of  this  paper,  along  with  the  main  theoretical 
result:  The  Monte  Carlo  algorithm  for  GHMMs,  called  Monte  Carlo  hidden  Markov  models  (in 
short  MCHMM).  A  MCHMM  is  a  computational  instantiation  of  a  GHMM  that  represents  all 
densities  through  samples  and  trees.  It  applies  likelihood-weighted  sampling  for  forward  and 
backward  projection  (c.f..  Equations  (1 1)  and  (12)),  and  it  uses  annealing  and  cross-validation  to 
determine  the  best  shrinkage  factor.  To  ensure  convergence,  the  number  of  samples  N  is  increased 
over  time. 
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Table  1:  The  MCHMM  algorithm  at-a-glance. 

Model  Initialization:  Initialize  A  =  {tt,  /r,  j/}  by  three  randomly  drawn  sets  of  samples  of  the  appropriate 
dimension.  Generate  density  trees  from  these  samples.  Set  p  =  1,  and  chose  an  initial  sample  set  size 
AT  >  0. 

E-step: 

1.  Copy  the  sample  set  representing  n  into  ao  (c.f..  (9)). 

2.  Set  Pt  ~  1* 

3.  Computation  of  at  (c.f.,  (1 1)).  For  each  t  with  1  <  <  T  do: 

(a)  Generate  N  samples  {x^ ,Px»)  from  the  sample  set  representing  using  likelihood- 
weighted  sampling. 

(b)  For  each  sample  {x\  p^/),  generate  the  conditional  density  ii[x  |  a?')  using  the  tree-version  of 
p.  Sample  a  single  x  from  this  tree,  using  likelihood-weighted  sampling. 

(c)  Set  px  to  a  value  proportional  to  v{Ot  \  x),  where  Ot  is  the  t-th  observation  in  the  data  set. 
This  density  value  is  obtained  using  the  tree  representing  i/, 

(d)  Generate  a  tree  from  the  new  sample  set. 

4.  Computation  of  Pt  (c.f.,  (12)).  For  each  t  with  1  <  ^  <  T  do: 

(a)  Generate  N  samples  {x^p^f)  from  the  sample  set  representing  using  likelihood- 
weighted  sampling. 

(b)  For  each  sample  {x\  p^>),  generate  the  conditional  density  |  x)  using  the  tree- version  of 
p.  Sample  a  single  x  from  this  tree,  using  likelihood-weighted  sampling. 

(c)  Set  px  to  a  value  proportional  \  x'),  where  Ot+i  is  the  t  -f  1-th  observation  in  the  data 

set.  This  density  value  is  obtained  using  the  tree  representing  i/. 

(d)  Generate  a  tree  from  the  new  sample  set. 

5.  Computation  of  7^  (c.f.,  (13)).  For  each  t  with  1  <  ^  <  T  do: 

(a)  Generate  N/2  sample  from  at  by  likelihood  weighted  sampling  and  assign  to  each  sample 
{x,  Px)  a  probability  proportional  to  f3t{x),  using  the  tree  approximation  of  fit . 

(b)  Generate  N/2  sample  from  Pt  by  likelihood  weighted  sampling  and  assign  to  each  sampled 

{x  a  probability  px  proportional  to  (x) ,  using  the  tree  approximation  of 

M-step: 

1.  Estimation  of  the  new  state  transition  density  p  (c.f.,  (16)):  Pick  N  random  times  ^G{1,...,T~1} 
and  generate  samples  (x,  p^)  and  {x^ ,Px>)  from  7^,  and  7t+i ,  respectively,  by  likelihood-weighted 
sampling.  Add  ((a:,  a?') ,  N~'^)  into  the  sample  set  representing  p.  Generate  a  tree  from  the  sample 
set. 

2.  Estimation  of  the  new  observation  density  z/  (c.f.,  (17)):  Pick  N  random  t  G  {I, . .  .,r}  and 
generate  a  sample  {x,po/)  from  7t  by  likelihood-weighted  sampling.  Add  ((a?,  Ot),N~^)  into  the 
sample  set  representing  1/,  Generate  a  tree  from  the  sample  set. 

3.  Estimation  of  the  new  initial  state  distribution  tt  (c.f.,  (15)):  Copy  the  sample  set  70  into  tt.  Generate 
a  tree  from  the  sample  set. 

Annealing:  Set  p  <r-  pp.  Stop  when  the  likelihood  of  an  independent  cross-validation  set  is  at  its  maxi¬ 
mum. 


Sample  set  size:  Increase  JV. 
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The  learning  algorithm  for  MCHMM  is  described  in  detail  in  Table  1.  MCHMMs  use  both 
sample-based  and  tree  representations  during  learning.  After  learning,  it  suffices  to  store  only  the 
tree-based  version  of  the  model  A  =  {tt,  /x,  u]‘,  all  sample  sets  can  be  discarded.  When  applying 
a  trained  MCHMM  to  a  new  data  set  (e.g.,  for  analysis,  prediction,  or  classification),  only  the 
“forward”  densities  oit{x)  have  to  be  estimated  (just  like  in  conventional  HMMs,  Kalman  filters 
[21,  31],  or  dynamic  belief  networks  [8,  45]).  For  that,  no  trees  have  to  be  grown.  Instead, 
samples  for  at^i  (a;)  are  obtained  by  sampling  from  the  sample  set  representing  at{x)  (see  also 
Section  3.1)  and  the  tree  representing  /j..  The  pj  -values  of  a,+i  (.r)  are  determined  using  the  tree 
representing  i/.  This  recursive  resampling  technique,  known  as  sampling/importance  resampling 
[44,  49]  applied  to  time-invariant  Markov  chains,  converges  at  the  rate  1/ VN  (if  T  is  finite,  c.f.. 
Section  3.2).  Sampling/importance  resampling,  which  will  further  be  discussed  in  Section  8,  has 
been  suecessfully  applied  in  domains  such  as  computer  vision  and  robotics  [11, 18]. 


6  Error  Analysis  and  Convergence  Results 

This  section  presents  the  major  convergence  result  for  MCHMM.  Under  mild  assumptions,  MCH¬ 
MMs  can  be  shown  to  converge  with  probability  1  to  models  A  that  locally  maximize  likelihood; 
just  like  conventional  HMMs.  The  proof  builds  on  the  well-known  results  for  discrete  HMMs  (see 
g.e.,  [2,  19,  20,  33,  47]),  and  shows  that  if  the  sample  size  N  is  sufficiently  large,  the  deviation 
between  the  MCHMMM  and  the  corresponding  GHMM  can  be  bounded  arbitrarily  tightly  with 
high  probability,  exploiting  the  asymptotic  consistency  of  the  two  major  approximations:  samples 
and  density  trees. 


6.1  Relative  Error 


The  process  of  Monte  Carlo  simulation  introduces  error  into  the  Baum- Welch  algorithm.  In  order 
to  understand  how  this  affects  convergence,  we  need  to  understand  how  an  initial  error  will  prop¬ 
agate  through  an  EM  step.  This  analysis  will  also  cover  errors  introduced  in  the  middle  of  an  EM 
step  as  long  as  such  errors  converge  to  0  with  large  samples,  because  errors  introduced  in  mid-step 
will  be  indistinguishable  from  an  error  introduced  at  the  beginning  of  the  calculation. 

What  we  would  like  to  prove  is  that  a  small  absolute  initial  error  will  imply  a  small  absolute 
final  error.  Unfortunately,  this  is  not  possible  because  several  calculations  are  normalized  integrals 
of  products.  Consider  the  integral  f{x)g(x)dx.  H  f{x)  =  step{x  —  1/2)  and^(a;)  =  1  —  f[x) 
then  fix)g{x)  dx  =  0.  However,  if  / {x)  =  f{x)  -|-  e  and  g{x)  =  g{x)  -f  e  are  integrated,  we 
get  a  non-zero  integral:  f{x)g{x)dx  =  e.  First,  notice  that  the  calculation  of  at{x)  is  of  this 

form.  Assume,  for  the  moment,  that  y3t(a;)  =  1.  Then7<(x)  =  o:t{x)/  J  ai{x')dx'.  Consequently, 
if  at{x)  was  small  (<  e)  everywhere,  jt{x)  can  potentially  be  very  far  from  7i(a:).  Furthermore, 
any  allowable  error  can  produce  an  arbitrarily  large  error  in  the  output. 

However,  if  we  have  a  small  relative  error. 


f{x)  -  f{x) 

fi^) 


(52) 


we  will  be  able  to  prove  that  a  small  initial  error  implies  a  small  final  error.  The  restriction  to 
small  relative  error  is  significant  because  density  trees  only  guarantee  a  small  absolute  error.  An 
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extra  assumption  must  be  placed  on  the  initial  distribution  in  order  to  guarantee  that  the  density 
tree  produced  by  sampling  from  the  distribution  will  have  small  relative  error. 

To  simplify  the  proof  that  a  small  initial  error  produces  a  small  final  error,  we  only  consider 
transforming  functions  to  first  order,  / (s)  ~  f'ix)  [x  —  x).  There  is  no  zeroth  order  error  because 
the  limit  as  the  relative  error  approaches  0  is  f{x),  lim^-^j;  f{x)  =  f{x).  Higher  order  error 
becomes  small  quickly  for  all  transforming  functions  which  we  consider.  Consequently,  we  will 
be  able  to  state  that  for  all  s,  5,  there  exists  some  number  of  samples,  N,  s.t.  f{x)  —  f{x)  < 
f'{x){x  -  x)  +  €  with  probability  1  -  <5. 

Our  analysis  does  not  consider  the  shrinkage  factor  p,  which,  since  lim„^oo  =  0,  has 
asymptotically  no  effect. 

6.2  Convergence  of  MCHMMs 

We  will  now  state  the  central  result  of  this  section:  the  MCHMM  convergence  theorem  and  an 
important  corollary.  The  proof  of  the  theorem  will  be  developed  throughout  the  remainder  of  this 
section. 

Theorem  4  (MCHMM  Convergence  Theorem).  If  approximation  of  the  underlying  distribu¬ 
tions  by  density  trees  causes  only  a  finite  relative  error  which  decreases  toOasN  oo  and  an  EM 

step  in  a  GHMM starting  with  produces  output  distributions 

then  for  alle,S  there  exists  an  N  s.t.  the  output  of  an  MCHMM  iteration  taking  ,  i/(")  as 

inputs  with  probability  1-S  satisfies <  £,  < 

s,  and  |r'^”+^)(Of|x)  —  <  e  where  and  are  the  output  dis¬ 

tributions  of  the  MCHMM  iteration. 

The  proof  of  Theorem  4  will  be  presented  below,  after  introducing  a  collection  of  useful  lem¬ 
mas.  However,  Theorem  4  implies  the  following  important  corollary: 

Corollary.  Under  the  same  assumptions  as  in  Theorem  4,  any  strictly  monotonically  increas¬ 
ing  schedule  of  N ’s  will  cause  a  MCHMM  to  converge  to  a  local  maximum  in  with  probability 
1. 

Proof  of  the  corollary.  The  convergence  of  GHMMs  is  a  straightforward  extension  of  the  well- 
known  convergence  proof  for  HMMs  as  outlined  in  the  proof  of  Theorem  1.  The  action  of  a 
GHMM  iteration  causes  movement  through  the  space  of  distributions.  We  can  view  this  action  of 
a  GHMM  iteration  as  inducing  a  vector  field  in  the  (infinite  dimensional)  space  of  distributions 
with  a  magnitude  proportional  to  the  change  in  likelihood  and  a  direction  given  by  (x,  x')  - 
A(”'^^)(x,  x').  The  action  of  an  MCHMM  with  probability  1  —  5  will  have  a  result  in  a  perturbed 
vector  with  the  error  bounded  by  e.  The  repeated  action  of  a  MCHMM  with  N  increasing  will 
create  a  Cauchy  sequence  where  the  e  decreases  to  0  and  S  decreases  to  0.  This  Cauchy  sequence 
will  limit  to  local  maxima  with  probability  1 .  □ 

It  is  important  to  notice  that  this  corollary  does  not  state  that  an  MCHMM  and  GHMM  starting 
with  the  same  distribution  will  converge  to  the  same  local  maximum.  Such  a  result  will  gener¬ 
ally  not  hold  true,  as  the  finiteness  of  N  might  influence  the  specific  local  maximum  to  which  a 
MCHMM  converges. 

It  is  also  interesting  to  note  that  the  noise  introduced  by  MCHMMs  in  the  convergence  step 
may  be  beneficial.  The  argument  is  informal  because  the  noise  of  an  MCHMM  has  both  systematic 
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(due  to  representation)  and  random  (due  to  sampling)  effects.  First,  consider  a  GHMM  which 
accidentally  starts  out  at  a  local  minimum  (or  a  saddle  point).  The  GHMM  might  be  stuck  here, 
but  a  MCHMM  will,  with  probability  1,  step  off  of  the  local  minima  and  move  towards  a  local 
maxima.  In  addition,  consider  a  landscape  containing  many  local  maxima  on  it’s  flanks.  A  GHMM 
could  easily  become  trapped  in  a  local  maxima  while  an  MCHMM  with  a  low  N  will  be  kicked 
out  of  the  local  maxima  with  high  probability.  In  this  way,  an  MCHMM  could  gain  some  of  the 
benefits  of  simulated  aimealing. 


6.3  Propagation  of  Error 


Before  we  can  prove  the  MCHMM  convergence  theorem,  we  will  need  to  develop  an  analysis  of 
the  propagation  of  errors  through  transforming  functions. 

Let  us  start  by  assuming  that  we  have  some  initial  error  lf(x)  =  7r(a;)  +  A^(a;),  Jl{x\x')  = 
IJ,{x\x')  +  A^(a:,a;')  andv{0t\x)  =  i^(Ot|x)  +  Aj/(Oi,a;).  These  errors  can,  for  example,  be  the 
errors  introduced  by  approximating  a  distribution  with  density  trees.  It  will  also  be  convenient  to 
talk  about  the  relative  error 


K{Ot\x) 

AKy) 


Att  (3^) 

7r{x) 

A^(a;|a 

0 

H{x\x')  1 
A.(Otlx) 

i/(C)t|i 

A/(y)| 

0 

fiy) 


(53) 

(54) 

(55) 

(56) 


and  the  largest  relative  error  over  the  input  range 


a;  =  maxA;(a;) 

(57) 

A’’  =  maxA);(a;|a;') 

(58) 

Al  =  maxA;;(Ot|a;) 

XjOt 

(59) 

Ay  =  max  Ay  (y) 

(60) 

The  propagation  of  errors  through  individual  calculations  of  Baum-Welch  will  be  characterized  by 
taking  a  derivative.  Letx  =  a;+A(x).  The  definition  of  a  derivative  is  . 

As  the  difference  between  2  distributions,  x  and  x  approaches  0,  the  difference  between  die  re¬ 
sulting  calculations,  f{x)  and  f{x),  will  approach  f'{x)A{x)  implying  A/(x)  =  f{x)  —  f{x)  = 
f'{x)A{x)  in  the  limit  that  A  (a;)  0.  In  particular, 

—  A/ (a;)  -|-  Ag(x) 

A  /  N  _  -g(a;)A/(a;)  +  /(a;)Ag(a:) 

’  ~  f{x)^ 

A/5(a;)  =  f{x)Ag{x)  +  g{x)Af{x) 


(61) 

(62) 

(63) 
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^//  =  J^f  (64) 

The  last  statement  is  only  true  for  integration  over  compact  regions  with  finite  valued  f{x),  be¬ 
cause  f  f{x)  -  f{x)dx  =  f  f{x)dx  -  f  f{x)dx  under  these  assumptions.  These  assumptions 
were  already  made  to  prove  convergence  of  density  trees  so  they  are  not  an  additional  restriction. 


6.4  Error  Bounds  for  Auxiliary  Variables 


Given  the  above  rules  and  notation  we  can  calculate  how  the  error  will  propagate  through  the  cal¬ 
culation  of  a  at  (x) .  The  following  lemmas  establish  bounds  on  the  relative  errors  of  all  quantities 
calculated  in  a  Baum- Welch  iteration  given  some  initial  error.  These  lemmas  are  necessary  to 
prove  the  MCHMM  convergence  theorem. 

Lemma  3.  <  {t  -  1)(A^  -t-  AJ^)  -|-  A^  to  first  order. 

Proof.  According  to  the  property  (64), 


Aat  Aj' 


,(a;)  =  /  Ac,_,^j,(x)  dx 

Jx 

Using  (63),  this  expression  is  equivalent  to 

=  /  (at)  +  i/(i)a,_,(i)A„(a:)  +  /i(x)a,_i(i)A„(i)  dx 

=  /  /.(x)r(x)a,_,  (X)  ^  ix 

Jx  \  at-i{x)  i/{x)  /t(x)  y 

Notice  that  the  integral  can  be  bounded  by  a  maximum, 


Ao,  <  a't(x)  max 


Aqh_i  (x)  ^  A;x(x)  ^  A^(x) 
a't-i(x)  i/(x)  yu(x) 


and. 


Aot  >  — Q;t(a^)  max  ^ 


(x)  A^(x)  Afi(x) 


0!t-l(x)  l/(x)  /i(x) 


*0!t 


0!t(x) 


<  max 


( 


at-i(x) 


+ 


A^(x) 


iy(x) 


+ 


A^(x) 


jii(x) 


^at  ^  _ f 

^at-i  (^) 

1 

1 

A^(a:) 

✓  V  iXidiA.  1 

at{x)  X  \ 

z/(x) 

“T 

fj.(x) 

(65) 


(66) 

(67) 

(68) 

(69) 

(70) 

Consequently,  we  have  that  A^^  <  A^^  j  -t-  AJ^  -|-  A^  The  equation  is  recursive,  terminating  for 
t  =  0  with  A^j  =  A^.  At  each  step  in  the  recursion,  at  most  A^  +  A^  is  added  to  the  error  which 
implies 


Division  by  at{x)and.  application  of  the  triangle  inequality  yields  the  relative  error  bounds 


A;,<(t-i)(A::-hA;)  +  A; 

which  proves  Lemma  3. 


□ 
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The  calculation  of  the  relative  fSt  error  is  analogous,  and  the  resulting  bound  it  stated  in  the 
following  lemma: 

Lemma  4.  <  {T  -t)  ( +  AJ^)  to  first  order. 

The  proof  is  omitted,  since  it  is  similar  to  the  proof  for  AJ,  ^ . 

The  calculation  of  the  error  of  is  more  difficult. 

Lemma  5.  A!^  <  2(A^  +  AJg)  =  2(A!J.  +  (T  -  1)  (AJ]  +  AJ',))  to  first  order. 

Proof.  To  simplify  the  notation,  we  will  assume  a  /  subscript  in  the  following  proof.  The 
results  are  independent  of  t. 


A.y(a;)  =  A  og  (a:) 


ap 


J^a{x)l3{x)dx  -  Q;(a;)/?(.i 

_  l/3(x)Aa(x)  +  a(x)A0(x)]  o(a-).j(.r)d.r 

a(x)/3(x)  fJa(x)A0(.r)  +  .j(.r)Ac(.T)]da; 

[/3(x)Aa(x)  +  a(x)A/^(x)]  a(x):3i.r)dx 
a(x)/3(x)[f^  a(x)/3(x)dx]  max^ 


< 


+ 


a(a?) 


<  a(x)j3(x) 


Aa(g)  I  A^(3;)1  ,  Aa(j)|\ 

a(x)  ^  /3{x)  j  ^  ^  \|  /3(x)  ^  a(x)  \) 


f^a{x)/3{x)dx 


(71) 


Through  similar  manipulations,  we  get: 


A^(a;)  > 


—a{x)0{x) 


Ao[(a7)  1  A/3(a7) 

Q;(a7)  ’  I3(x) 

+  max^  ^ 

I3(x) 

+ 

Aa(x) 

a(x) 

) 

J 

f^a{x)l3{x)dx 

Now,  we  can  divide  by  a  factor  of  7  to  get 


(72) 


a;  <  2(a;  +  a^)  (73) 

and  apply  Lemmas  3  and  4  for  any  value  of  t  to  get 

a;  <  2(A;  +  (r-i)(A;;  +  A:;))  (74) 

which  completes  the  proof.  □ 

The  bounds  for  A^  are  similar  to  those  of  A!^,  as  documented  by  the  following  lenraia. 
Lemma  6.  A^  <‘^{A’^  +  {T  -  1)  (A^  +  A^))  to  first  order. 

The  proof  of  Lemma  6  is  analogous  to  that  of  Lemma  5  and  therefore  omitted.  Notice  that  the 
bounds  of  A^  and  A!y  are  independent  of  t. 
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The  M  step  calculations  are  done  similarly.  Let  tt'  =  the  next  tt  distribution,  /x'  =the  next  // 
distribution,  and  v'  =  the  next  u  distribution.  Then  the  following  lemmas  provide  bounds  for  the 
error  of  tt',  //',  and  v'. 

Lemma  7.  A^,  =  2(A’^  +  (T  -  1)  (A^  +  A^))  to  first  order. 

Proof.  Lemma  7  follows  from  Lemma  5,  since  A^,  =  Al^^ . 

Lemma  8.  Ap  =  4(A^  +  (T  -  1)  (A^  +  A^))  to  Srst  order. 

By  property  (62), 


/T-l 


~  7tix)j  Ay^^{x,x')  +  ^tix,x')j  A^^(a;)' 

-  (E7-w) 

/T-l  \  /T-l 

+  I]6(a;,a;0  A^,(x) 


< 


T-l 

.t=l 
T-l 

Ee 


a=i 

^T-l 
^=1  ‘ 
^T-1 


,t=l 
^T-1 


E;.ri‘A;,(»,»o  el,  a„(i) 


1 


Lt=l 


T:i=i^^t{x,x')  '  \  El=i^itix) 

I  ^7t  ^ 


max 

t 


+ 


and,  analogously, 

A^i>{x'\x)  > 

These  equations  are  of  the  form: 

Ani{x'\x)  <  iJ,{x'\x)  ma,x 

<  \  It  J 


T-l  • 

E 

Lt=l 


*  Tt  /  Et=i  Tt 

V  6 


(75) 


7t  y  EfjiSt 


and 


A^/(a;'|a;)  >  —^{x'\x)max  f-^^  + 

*  V  Zt 


-*7t 

Tt 


(76) 


(77) 


(78) 


which  implies  that  A^,  <  (A^  +  A!;).  The  lemma  follows  from  the  bounds  for  A!;  and  A^ 
stated  in  Lemma  5  and  6.  □ 

Lemma  9.  Al,  =  4(A;  +  (T  -  1)  (A;;  +  A^)))  to  first  order. 

The  proof  is  analogous  to  the  proof  of  Lemma  8. 

Theorem  5.  The  error  introduced  in  a  round  of  Baum-Welcb  with  some  initial  AJJ.,  AJ^,  and 
a;:  to  first  order  is  a;,  =  2{{T  -  1)(A;:  +  A^)  +  A;),  A;;,  =  4((T  -  1)(A;:  +  Ap  +  a{),  and 
A:,=mT-l){Al  +  Al)  +  Al). 

Proof.  Theorem  5  follows  directly  from  Lemmas  7  to  9.  □ 
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6.5  Implications  for  MCHMMs 

The  analysis  of  error  propagation  for  GHMMs  implies  that  as  N  ^  oo  and  the  relative  errors 
approach  0  the  error  of  a  step  in  the  generalized  Baum-Welch  algorithm  tends  towards  the  first 
order  errors:  \/s  >  0  :  3N  :A;,  =  2((T  -  1)(A;;  +  Ap  +  A;)  +  A;,  =  4((T  -  1)(A;:  + 

Ap  +  A^)  +  s,  and  A^,  =  4((!r  - 1)  (AJ^  +  A][^)  +  AJJ.)  +  £.  These  first  order  errors  also  converge 
to  0. 


Jim  Al, 

=  0 

ON 

N-^OO 

lim  A„/ 

N^co  ^ 

=  0 

(80) 

lim  a;, 

=  0 

(81) 

JV^oo  ^ 

We  are  now  ready  to  present  the  central  result  of  this  section,  a  proof  of  convergence  for  MCH¬ 
MMs. 

Proof  of  Theorem  4  (MCHMM  Convergence  Theorem).  According  to  Theorem  5,  the  first  or¬ 
der  relative  errors  of  and  are  linear  in  the  relative  errors  of 

and  By  assumption,  in  the  limit  as  N  oo  these  errors  go  to  0  with  probability  1,  implying 
the  theorem.  □ 

MCHMMs  converge  to  a  local  maximum,  under  the  additional  assumption  that  the  relative 
error  of  all  distributions  converge  to  0  as  A  oo.  Notice  that  the  p  parameter  was  not  used 
here — ^the  p  parameter  selects  between  various  models,  while  we  only  proved  convergence  within 
one  model.  However,  since  p  — >  0,  its  effect  vanishes  over  time. 

The  MCHMM  Convergence  Theorem  establishes  the  soundness  of  our  algorithm,  and  shows 
that  MCHMMs  can  indeed  be  applied  for  learning  a  large  range  of  non-parametric  statistical  mod¬ 
els  with  real-valued  state  and  observation  spaces.  Empirical  results  using  finite  sample  sets,  de¬ 
scribed  in  the  next  section,  suggest  that  the  algorithm  is  stable  even  for  small  sample  set  sizes,  and 
indeed  maximizes  data  likelihood  in  a  Monte  Carlo-fashion. 


7  Experimental  Results 

We  have  applied  MCHMMs  to  two  problems,  an  artificial  one  which  was  chosen  for  demonstration 
purposes,  and  a  more  difficult  real-world  gesture  recognition  problem.  The  experiments  address 
the  following  questions: 

•  Do  MCHMMs  converge  empirically?  If  so,  how  good  are  the  resulting  MCHMMs? 

•  What  accuracies  can  be  obtained  when  using  MCHMMs  for  discrimination? 

•  How  does  the  sample  set  size  affect  computational  and  accuracy  trade-offs? 

The  first  data  set,  called  the  noisy  oscillation  dataset,  consists  of  multiple  sequences  that  basi¬ 
cally  oscillate  around  two  points.  Observations  are  10-dimensional  and  governed  by 


O 


(0.25  -|-  €tp,  0.25  -|-  Gt,2)  0.25  -|-  £<,3, . . . ,  0.25  -|-  £^<,io)  if  I 

(0.75  -|-  s-tp,  0.75  -|-  £t,2y  0.75  -|-  £t,zi  •  •  •  >  0.75  -|-  £'(,io)  if  I  even 
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Figure  4:  Examples  of  gestures.  The  first  two  gestures  were  drawn  counterclockwise;  whereas  the  other  two  were 
drawn  clockwise.  The  MCHMM  learning  task  is  to  differentiate  them. 


where  €t,i  (with  1  <  f  <  20  and  1  <  t  <  10)  are  independent  and  identically  distributed  noise 
variables  with  zero-centered  triangular  distribution  over  [-0.15: 0.15].  To  test  the  discriminatory 
accuracy  of  the  learned  model,  we  also  generated  a  second,  similar  data  set: 

(0.25-het,i,0.75  +  £t,2,0.25  +  ei,3 . 0.75-hf<,io)  if  ^  odd 

(0.75-fet,i,0.25-Fe:t,25)0.75-f  ^(,3 - 0.25-|-5i,io)  if  i  even  ^ 

using  new,  independent  noise  variables.  Notice  that  despite  the  fact  that  there  is  a  good  amount 
of  noise  in  the  data,  these  data  sets  are  relatively  easy  to  discriminate,  since  their  observations  fall 
into  different  regions  in  the 

The  second  data  set  consisted  of  a  collection  of  hand-drawn  gestures,  represented  in  3?^.  Fig¬ 
ure  4  shows  examples.  Once  drawn,  all  gestures  in  our  data  base  look  quite  similar.  However, 
some  of  the  gestures  were  drawn  clockwise,  whereas  others  were  drawn  counterclockwise.  Here 
we  are  interested  in  discriminating  clockwise  from  counterclockwise  gestures.  Notice  that  this 
problem  is  more  difficult  then  the  artificial  one,  as  the  observations  alone  (stripped  of  their  tem¬ 
poral  order)  are  insufficient  for  discrimination;  instead,  the  MCHMM  has  to  learn  a  meaningful 
model  of  the  internal  state. 

Figures  5  and  6  show  results  obtained  for  the  &st  dataset.  Shown  in  both  figures  are  curves 
that  characterize  the  “log  score”  as  a  function  of  the  number  of  iterations.  This  score  is  the  real¬ 
valued  analogue  to  the  likelihood;  however,  since  densities  can  be  larger  than  one,  the  score  can 
also  be  larger  than  one,  and  hence  its  logarithm  is  not  bounded  above. 

The  curves  in  Figure  5  show  average  log  score  for  the  training  set  as  a  function  of  the  number 
of  iterations,  averaged  over  10  independent  experiments  (each  curve).  The  different  curves  were 
obtained  using  different  numbers  of  samples  which,  contrary  to  the  theoretical  results,  were  kept 
constant  throughout  the  experiments: 


Sample  set  size  N  for. . . 

a,l3,-y,n 

top  curve: 

1,000 

1,000 

middle  curve: 

1,000 

100 

bottom  curve: 

100 

10 

Figure  5  also  provides  95%  confidence  bars  for  these  results.  The  difference  in  absolute  levels  to 
which  the  scores  appear  to  converge  stem  from  the  fact  that  the  maximum  tree  depth,  and  hence 
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Figure  5:  Log  score  of  training  data  as  a  function  of  the  iteration  of  EM,  for  the  synthetic  data  set  Top  curve:  1000 
samples  for  aU  densities.  Middle  curve:  1000  samples  for  n  and  v,  100  samples  for  a,/3,‘(,  and  tt.  Bottom  curve:  100 
samples  for  and  v,  10  samples  for  a,  j3,  7,  and  tt.  These  graphs  illustrate  that  MCHMMs  tend  to  maximize  the  data 
likelihood,  even  in  the  finite  case.  Each  result  is  averaged  over  10  independent  experiments;  95%  confidence  bars  are 
also  shown 


the  maximum  of  the  tree  density  /,  is  a  function  of  the  sample  set  size;  hence,  if  the  number  of 
samples  is  small,  the  tree  will  remain  shallow.  The  key  observation,  however,  is  their  monotonicity. 
These  curves  demonstrate  that  each  iteration  in  MCHMM  indeed  increases  the  data  likelihood  (or 
leaves  it  unchanged),  even  if  only  finitely  many  samples  are  used.  This  finding,  which  we  also 
consistently  observed  for  other  data  sets  and  parameter  settings,  illustrate  that  MCHMMs  are 
well-behaved  in  practice,  that  is,  the  finiteness  of  the  sample  set  appears  not  to  cause  catastrophic 
problems. 

Figure  6  shows  results  for  independent  testing  data.  The  upper  curve  in  Figure  6  depicts  the  log 
score  for  set  of  random  sequence  that  are  generated  from  the  same  model  as  the  training  sequences. 
The  bottom  curve  depicts  the  log  score  for  independently  generated  sequences  using  the  other 
model,  for  which  the  MCHMM  was  not  trained.  In  both  cases,  only  a  single  data  set  (of  length 
20)  was  used  for  training,  and  testing  was  performed  using  20  data  sets  generated  independently. 
N  =  100  samples  were  used  throughout  for  all  densities  a,  /?,  7,  and  tt,  and  =  1, 000  samples 
were  used  for  /i  and  i/,  to  account  for  their  higher  dimensionality.  The  initial  shrinkage  factor  was 
p  =  1,  which  was  annealed  down  at  the  rate  p  =  0.9.  In  all  experiments,  the  dimension  of  the 
state  space  was  k  —  2.  The  results  obtained  for  lower-dimensional  observation  spaces,  which  are 
not  shown  here,  were  even  better.  In  all  cases,  the  score  clearly  reflected  that  the  MCHMM  had 
learned  a  good  model.  The  classification  accuracy  for  test  sets  was  consistently  100%. 

The  result  in  Figure  6  illustrate  the  discriminative  power  of  MCHMM  for  this  data  set.  It  also 
demonstrates  the  effect  of  annealing:  The  testing  data  likelihood  is  maximal  at  iteration  6  (thus 
the  optimal  pis  p  =  0.53),  beyond  which  the  MCHMM  starts  overfitting  the  data.  In  our  experi¬ 
ments,  the  optimal  stopping  point  was  consistently  found  within  ±1  step,  using  a  single,  indepen- 
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Fignre  6:  Log  score  of  testing  data  as  a  fimction  of  the  iteration  of  EM,  for  the  synthetic  data  set.  The  top  curve  shows 
the  log  score  of  data  generated  hum  the  model  used  in  training,  whereas  the  bottom  curve  shows  the  log  score  for  data 
generated  from  a  different  model.  95%  confidence  bars  are  also  shown. 


dently  generated  sequence  for  cross-validation.  In  all  our  experiments,  the  MCHMM  consistently 
discriminated  the  two  different  classes  without  error,  regardless  of  the  settings  of  the  individual 
parameters  (we  did  not  run  experiments  with  N  <  10).  The  bars  in  Figure  6  are  confidence  bars 
at  the  95%  confidence  level. 

Figure  7  shows  the  corresponding  result  for  the  more  difficult  gesture  recognition  problem. 
These  results  were  obtained  using  a  single  gesture  for  training  only,  and  using  50  gestures  for 
testing.  The  top  curve  depicts  the  log  score  for  gestures  drawn  in  the  same  direction  as  the  training 
gesture,  whereas  the  bottom  curve  shows  the  log  scores  for  gestures  drawn  in  opposite  direction. 
As  easily  noticed  in  Figure  7,  the  difference  between  classes  is  smaller  than  in  Figure  6 — ^which 
comes  at  no  surprise — yet  the  score  of  the  “correct”  class  is  still  a  factor  of  4  to  5  larger  than 
that  of  the  opposite  class.  Figure  7  also  shows  the  effect  of  annealing.  The  best  shrinkage  value 
is  obtained  after  12  iterations,  where  p  =  0.28.  As  in  the  previous  case,  cross-validation  using  a 
single  gesture  performs  well.  On  average,  the  classification  accuracy  when  using  cross-validation 
for  early  stopping  is  86.0%.  This  rate  is  remarkably  high,  given  that  only  a  single  gesture  per  class 
was  used  for  training. 

A  key  advantage  of  MCHMM  over  HMMs  lies  in  the  ease  of  trading  off  computational  com¬ 
plexity  and  accurecy.  In  conventional  HMM,  the  computational  complexity  is  fixed  for  any  given 
(trained)  model.  In  contrast,  MCHMM  permit  variable  sampling  set  sizes  during  run-time,  after 
the  model  is  trained.  This  leads  to  a  nice  any-time  algorithm  [9, 58].  Once  trained,  a  MCHMM  can 
control  its  computational  complexity  on-line  by  dynamically  adjusting  the  sample  size,  depending 
on  the  available  resources.  The  any-time  property  is  important  for  many  real-world  applications, 
where  computation  resources  are  bounded.  To  the  best  of  our  knowledge,  the  any-time  nature  is  a 
unique  property  of  MCHMMs,  which  is  not  shared  by  previous  HMM  algorithms. 
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Figure  7:  Log  score  for  the  gesture  data  base,  obtained  for  independent  testing  data.  The  top  curve  shows  the  log  score 
of  gestures  drawn  in  same  same  way  as  the  training  gesture,  whereas  the  bottom  curve  shows  the  log  score  of  gestures 
drawn  in  opposite  direction. 


Figure  8  illustrates  the  trade-off  between  computation  and  accuracy  empirically  for  the  gesture 
data  base.  Shown  there  is  the  trade-off  between  the  number  of  samples  and  the  likelihood  of  the 
testing  data.  Notice  that  the  horizontal  axis  is  logarithmic.  All  of  these  points  are  generated  using 
a  model  A  obtained  using  early  stopping.  The  sample  set  size  was  generated  after  training,  to 
investigate  the  effect  of  computational  limitations  on  the  on-line  performance  of  the  MCHMM. 
As  in  Figure  7,  the  top  curve  in  Figure  7  depicts  the  log  score  of  gestures  drawn  in  the  same 
direction  as  the  training  data,  whereas  the  bottom  curve  shows  the  log  score  of  gestures  drawn  in 
opposite  direction. 

The  result  in  Figure  8  illustrates  that  the  score  (and  hence  the  accuracy)  of  both  data  sets  in¬ 
creases  with  the  sample  set  size.  This  is  not  surprising,  as  the  accuracy  of  the  approximations 
increases  with  the  sample  set  size.  In  addition.  Figure  8  suggests  that  the  distance  between  clock¬ 
wise  and  counterclockwise  gestures  is  approximately  constant  in  log  score  space  (hence  it  grows  in 
score).  This  illustrates  that  better  results  are  achieved  with  larger  sample  sets.  In  our  experiments, 
however,  even  small  sample  sets  yielded  good  discrimination  (after  training  with  larger  sample 
sets).  For  example,  we  observed  on  average  16%  classification  error  with  A  =  10  samples.  The 
processing  of  such  sample  sets  is  several  orders  of  magnitude  faster  than  the  hand  motion  involved 
in  gesture  generation.  This  makes  MCHMMs  extremely  fast  for  on-line  tracking  and  discrimina¬ 
tion  in  this  domain. 

Following  suggestions  by  Koller^,  we  evaluated  the  utility  of  smoothing  in  the  context  of 
MCHMMs  [13,  24].  Smoothing  transforms  those  sample  sets  that  represent  posteriors  over  states 
(a  and  /?)  into  more  compact  representations  such  as  density  trees,  which  are  then  used  for 
likelihood-weighted  sampling  instead  of  the  original  samples.  To  apply  smoothing  to  MCHMMs, 
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Figure  8:  Trade-off  between  sample  set  size  and  log  score,  for  the  gesture  data  set  The  more  samples  are  available,  the 
higher  the  score,  and  hence  the  more  accurate  the  results.  However,  even  extremely  small  sample  sets  yield  reasonable 
discrimination. 


the  basic  learning  algorithm  (see  Table  1)  was  modified  as  follows; 

Step  3a:  Generate  N  samples  (x',  p^i)  from  the  tree  representing  at-i  using  likelihood- weighted 
sampling. 

Step  4a:  Generate  N  samples  (x',  p^i)  from  the  tree  representing  using  likelihood-weighted 
sampling. 

Notice  that  this  is  a  minor  modification  of  the  basic  algorithm,  as  trees  are  readily  computed 
for  the  various  densities  a  and  /3;  but  now  they  are  also  used  for  likelihood- weighted  sampling. 
Obviously,  smoothing  reduces  the  variance  of  the  various  a  and  /?  at  the  cost  of  increasing  bias.  It 
has  been  suggested  that  smoothing  with  density  trees  can  further  improve  performance  [24]. 

Unfortunately,  in  our  experiments  we  were  unable  to  confirm  the  utility  of  smoothing  in  the 
context  of  MCHMMs.  For  the  artificial  data  set,  the  generalization  performance  of  MCHMMs 
with  smoothing  was  indistinguishable  from  that  obtained  without  smoothing,  both  in  terms  of  log 
score  of  the  testing  data  and  the  class  discrimination  accuracy.  Smoothing  actually  worsened  the 
performance  for  the  gesture  data  set.  Figure  9  shows  the  log  score  of  testing  data  using  smoothing 
(black  curve),  and  compares  it  with  the  corresponding  log  scores  obtained  without  smoothing 
(gray  curve,  copied  from  Figure  7).  Smoothing  was  found  to  reduce  the  log  score  of  the  testing 
data;  thus,  MCHMMs  trained  with  smoothing  are  less  capable  of  “explaining”  new  examples. 
This  is  not  surprising,  as  smoothing  is  not  an  information  loss-free  operation,  and  it  is  not  clear 
that  the  inductive  bias  provided  by  a  tree  representation  is  beneficial.  The  reduction  of  the  log 
score  led  to  a  slight  reduction  of  the  discrimination  accuracy.  Using  cross-validation,  the  average 
classification  accuracy  for  MCHMMs  with  smoothing  was  82.0%,  which  is  4.0%  smaller  than  the 
accuracy  obtained  without  smoothing. 
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Figure  9:  MCHMMs  with  (black  curve)  and  without  (grey  curve)  density  tree  smoothing  of  a  and  /3,  under  otherwise 
equal  conditions.  Each  curve  plots  the  log  score  of  the  testing  data  as  a  function  of  the  number  of  Baum-Welch  iterations. 
The  grey  curve  is  equivalent  to  the  top  curve  in  Figure  7.  These  results  illustrate  the  negative  effect  of  smoothing  on 
log  score  of  the  testing  data. 


8  Related  Work 

Hidden  Markov  Models  [2,  43]  have  been  successfully  applied  to  a  huge  range  of  applications 
requiring  temporal  signal  processing.  As  indicated  in  the  introduction  of  this  paper,  most  state-of- 
the-art  speech  recognition  systems  rely  on  HMMs  (see,  for  example,  the  various  papers  in  [54]). 
Recently,  extensions  of  HMMs  have  been  successfully  applied  in  the  context  of  mobile  robotics 
[23, 48, 52],  where  they  have  led  to  improved  solutions  of  the  concurrent  mapping  and  localization 
problem,  which  is  generally  acknowledged  as  one  of  the  most  difficult  problems  in  robotics  [6, 25]. 
These  are  just  two  examples  of  successful  application  areas  of  HMM;  the  number  of  successful 
applications  is  numerous. 

Most  HMM  algorithms  differ  from  MCHMMs  in  that  they  only  apply  to  discrete  state  and 
observation  spaces.  While  the  majority  of  HMM  approaches  represent  states  individually  (in  a 
flat  way),  some  more  recent  approaches  have  extended  HMMs  to  loosely  coupled  factorial  repre¬ 
sentations  [15,  46],  which  represent  state  by  features  but  are  nevertheless  discrete.  Others  have 
successfully  proposed  HMM  models  that  can  cope  with  real- valued  observation  spaces  [29,  19]. 
Our  approach  generalizes  HMMs  to  continuous  state  and  observation  spaces  with  a  large,  non- 
parametric  range  of  state  transition  and  observation  densities.  It  also  differs  from  previous  HMM 
approaches  in  that  it  provides  a  mechanisms  for  complexity  control  (regularization);  in  previous 
approaches,  the  complexity  of  the  internal  state  space  had  to  be  calibrated  carefully  by  hand.  Fi¬ 
nally,  MCHMMs  provide  a  mechanisms  to  trade  off  computational  requirements  and  accuracy  in 
an  any-time  fashion,  which  differs  from  previous  approaehes  whose  computational  requirements 
were  not  adjustable  during  run-time  [9,  58].  We  envision  that  all  of  these  advances  are  important 
for  a  broad  range  of  practical  problems. 
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Sampling  techniques,  one  of  the  two  methods  used  for  density  approximation  in  MCHMMs, 
have  recently  gained  popularity  in  the  applied  statistics  and  AI  literature.  Various  researchers  have 
applied  sampling  methods  in  the  context  of  state  estimation  in  dynamical  systems  [3, 13,  22,  30, 
41]  and  learning  [7] .  A  nice  introduction  into  the  use  of  sampling  for  density  approximation  can  be 
found  in  [36].  Our  approach  to  state  estimation  (computation  of  the  as)  is  essentially  equivalent 
to  the  condensation  algorithm  proposed  by  Isard  and  Blake  [18]  and  the  Markov  localization 
algorithm  proposed  by  Dellaert  and  colleagues  [11,  10],  both  of  which  are  basically  versions  of 
the  well-known  sampling/importance  resampling  (SIR)  algorithm  [44, 49].  Similar  approaches  are 
known  as  particle  filters  [41],  bootstrap  [14],  and  survival  of  the  fittest  [22].  All  these  approaches 
are  concerned  with  state  estimation  in  an  HMM-like  or  Kalman  filter-like  fashion.  Thus,  they  rely 
on  the  a  priori  availability  of  p  and  u,  which  are  learned  from  data  by  the  MCHMM  algorithm 
proposed  here.  To  our  knowledge,  the  application  of  sampling-based  methods  to  learning  non- 
parametric  state  transition  models  and  observation  models  in  HMMs  is  new,  as  is  our  proposal  for 
the  integration  of  a  forward  (a)  and  backward  (/?)  phase  using  trees.  The  theoretical  results  in  this 
paper  demonstrate  that  our  approach  can  be  applied  to  a  large  class  of  problems,  assuming  that 
sufficiently  many  samples  are  used.  Trees  have  frequently  been  used  for  density  approximation, 
most  recently  in  [24,  35, 38,  39].  However,  we  are  not  aware  of  a  proof  of  asymptotic  consistency 
for  density  trees,  although  we  suspect  that  such  a  result  exists. 

9  Conclusion 

We  have  presented  a  new  algorithm  for  hidden  Markov  models,  called  Monte  Carlo  Hidden 
Markov  Models  (MCHMM).  MCHMMs  extend  HMMs  to  real- valued  state  and  observation  spaces. 
They  represent  all  densities  by  samples,  which  are  transformed  into  probability  density  functions 
using  density  trees.  Both  representations  were  proven  to  be  asymptotically  consistent  under  mini¬ 
mal  assumptions  on  the  nature  of  the  density  that  is  being  approximated.  Because  the  continuous 
state  spaces  are  rich  enough  to  fit  (and  over-fit)  arbitrary  data  sets,  our  approach  uses  shrinkage 
to  reduce  its  complexity.  The  shrinkage  parameter  is  gradually  aimealed  down  over  time,  and 
cross-validation  (early  stopping)  is  used  to  select  the  best  model  complexity.  The  pervasive  use  of 
Monte  Carlo  sampling  led  to  the  design  of  an  any-time  implementation,  capable  of  dynamically 
trading  off  computational  complexity  and  the  accuracy  of  the  results.  Consequently,  MCHMMs 
are  well-suited  for  time-critical  applications  that  are  frequently  encountered  in  the  real  world. 

We  have  proved  the  asymptotic  consistency  of  MCHMMs  for  a  large  class  of  probability  den¬ 
sity  functions.  Empirical  results,  carried  out  in  an  artificial  domain  and  a  more  challenging  ges¬ 
ture  recognition  domain,  demonstrate  the  our  approach  generalizes  well  even  when  trained  with 
extremely  scarce  data.  Additional  experiments  characterize  the  natural  trade-off  between  sample 
set  size  and  accuracy,  illustrating  that  good  results  may  be  achieved  even  from  extremely  small 
sample  sets. 

While  the  theoretical  results  of  this  paper  hold  only  in  the  limit,  actual  applications  have  to  live 
with  finite  sample  sets.  Thus,  an  interesting  and  open  question  is  under  what  conditions  MCHMM 
with  finite  sample  sets  converges,  and  what  error  to  expect.  While  principal  error  bounds  of  this 
type  are  relatively  easy  to  obtain  in  static  approximation  setting,  the  recursive  nature  of  Baum- 
Welch  makes  it  difficult  to  obtain  bounds  for  MCHMM.  The  empirical  results  described  here. 
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however,  suggest  that  MCHMM  is  a  robust  method  that  works  well  in  practice  even  with  relatively 
small  sample  sets. 

We  conjecture  that  MCHMMs  are  better  suited  for  many  real-world  application  domains  such 
as  speech  and  robotics  applications  than  conventional  HMMs,  for  primarily  three  reasons:  their 
support  of  continuous  representations  of  observations  and  state,  their  built-in  mechanisms  for 
model  selection,  which  reduces  the  burden  of  picking  the  “right”  model  (e.g.,  number  of  states 
in  conventional  HMMs),  and,  finally,  their  support  of  any-time  computation,  which  makes  them 
extremely  compliant  in  time-critical  applications  with  bounded  computational  resources. 
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